Linear Regression

# Read in data
wine = read.csv("C:/Users/pySag/Desktop/MITx15.071x/Datasets/wine.csv")
str(wine)
## 'data.frame':    25 obs. of  7 variables:
##  $ Year       : int  1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
##  $ Price      : num  7.5 8.04 7.69 6.98 6.78 ...
##  $ WinterRain : int  600 690 502 420 582 485 763 830 697 608 ...
##  $ AGST       : num  17.1 16.7 17.1 16.1 16.4 ...
##  $ HarvestRain: int  160 80 130 110 187 187 290 38 52 155 ...
##  $ Age        : int  31 30 28 26 25 24 23 22 21 20 ...
##  $ FrancePop  : num  43184 43495 44218 45152 45654 ...
# 1. Year gives the year the wine was produced, a unique identifier for each observation.

# 2. Price ---> Dependent variable we are trying to predict.

# 3. Independent variables := WinterRain, AGST, HarvestRain, Age & FrancePop.
    # 3.1 We'll use this to predict the price.
# Linear Regression (one variable)

summary(wine)
##       Year          Price         WinterRain         AGST      
##  Min.   :1952   Min.   :6.205   Min.   :376.0   Min.   :14.98  
##  1st Qu.:1960   1st Qu.:6.519   1st Qu.:536.0   1st Qu.:16.20  
##  Median :1966   Median :7.121   Median :600.0   Median :16.53  
##  Mean   :1966   Mean   :7.067   Mean   :605.3   Mean   :16.51  
##  3rd Qu.:1972   3rd Qu.:7.495   3rd Qu.:697.0   3rd Qu.:17.07  
##  Max.   :1978   Max.   :8.494   Max.   :830.0   Max.   :17.65  
##   HarvestRain         Age         FrancePop    
##  Min.   : 38.0   Min.   : 5.0   Min.   :43184  
##  1st Qu.: 89.0   1st Qu.:11.0   1st Qu.:46584  
##  Median :130.0   Median :17.0   Median :50255  
##  Mean   :148.6   Mean   :17.2   Mean   :49694  
##  3rd Qu.:187.0   3rd Qu.:23.0   3rd Qu.:52894  
##  Max.   :292.0   Max.   :31.0   Max.   :54602
# The below code create one variable Linear Regression equation.
    # Using AGST as our one independent variable to predict price.

    # We will use lm() offered by R, lm - stands for "linear model".

    # This is the real deal to create linear regression models.

    # 1st argument : formula,

          # 2nd  argument : data frame, in this case, it our "wine data frame"

model1 = lm(Price ~ AGST, data=wine)
summary(model1)
## 
## Call:
## lm(formula = Price ~ AGST, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78450 -0.23882 -0.03727  0.38992  0.90318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.4178     2.4935  -1.371 0.183710    
## AGST          0.6351     0.1509   4.208 0.000335 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4993 on 23 degrees of freedom
## Multiple R-squared:  0.435,  Adjusted R-squared:  0.4105 
## F-statistic: 17.71 on 1 and 23 DF,  p-value: 0.000335
# Let's break it down :

      # call : is a description of our model       # Residuals ~ Errors 
      # Description of the coefficient of our model.
            # 1st row : Intercept terms.
            # 2nd row : Independent variable, AGST we passed in to our linear model.

            # 1st col : Estimate, gives estimates of the "Beta" value of our model.

                  # More spcifically, "(Beta not, or Beta 0), the coefficient of "intercept term".
                  # Beta0 := -3.4

                  # Beta1 := 0.635 ; The Coefficient of our independent variable (AGST). 
            
      # 3rd last output is the thing we are interested in.

            # "Residual Std.error := 0.4993 with 23 d.o.f( degree of freedom).

      # 2nd last output : Multiple R-squared := 0.435.

            # & Adjusted R-squared := 0.4105

                 # This adjusts R-squared value : Account for the number of independent variables used 'relative' to the no. of data.

            # [Note1] : Multiple R-squared will always increase, if we add more indep.var, and vice-versa for Adjusted R-squared "that's of no use to the model".

            # [Note2] : This is a good way to determine if an additional variale is cared to be included into the model or not.

            
# ( Sum of Squared Errors )
# Our residuals( errors ) := vector "model1"

model1 $ residuals
##           1           2           3           4           5           6 
##  0.04204258  0.82983774  0.21169394  0.15609432 -0.23119140  0.38991701 
##           7           8           9          10          11          12 
## -0.48959140  0.90318115  0.45372410  0.14887461 -0.23882157 -0.08974238 
##          13          14          15          16          17          18 
##  0.66185660 -0.05211511 -0.62726647 -0.74714947  0.42113502 -0.03727441 
##          19          20          21          22          23          24 
##  0.10685278 -0.78450270 -0.64017590 -0.05508720 -0.67055321 -0.22040381 
##          25 
##  0.55866518
SSE = sum(model1$residuals^2)
SSE
## [1] 5.734875
# Linear Regression (two variables)
    # To add another independent variable, use "+" sign followed by the desired     indep.variable of your choice.

model2 = lm(Price ~ AGST + HarvestRain, data=wine)
summary(model2)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88321 -0.19600  0.06178  0.15379  0.59722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.20265    1.85443  -1.188 0.247585    
## AGST         0.60262    0.11128   5.415 1.94e-05 ***
## HarvestRain -0.00457    0.00101  -4.525 0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared:  0.7074, Adjusted R-squared:  0.6808 
## F-statistic: 26.59 on 2 and 22 DF,  p-value: 1.347e-06
# Interpretation of our model:
      
      # (New) 3rd row : "HarvestRain" is added to our model.
          # Coefficient estimates := -0.00457

      # Lastly, "Multiple R-squared" & "Adjusted R-squared" has improved.
          # This means adding new indep.var has helped our model.
          
      # (Conclusion) : New model is "better" then our Previous model.

# Sum of Squared Errors
SSE = sum(model2 $ residuals^2)
SSE
## [1] 2.970373
# (inference):
    # (New) SSE is a improved version of our (Old) SSE.

#=============================================================#

# Linear Regression (all variables)
model3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
summary(model3)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age + 
##     FrancePop, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48179 -0.24662 -0.00726  0.22012  0.51987 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.504e-01  1.019e+01  -0.044 0.965202    
## AGST         6.012e-01  1.030e-01   5.836 1.27e-05 ***
## HarvestRain -3.958e-03  8.751e-04  -4.523 0.000233 ***
## WinterRain   1.043e-03  5.310e-04   1.963 0.064416 .  
## Age          5.847e-04  7.900e-02   0.007 0.994172    
## FrancePop   -4.953e-05  1.667e-04  -0.297 0.769578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3019 on 19 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.7845 
## F-statistic: 18.47 on 5 and 19 DF,  p-value: 1.044e-06
# (inference):
    # We've got 3 new rows, with 3 new independent variables.

    # Our Residual Std error has slightly decreased from our former model.

    # Multiple R-squared & Adjusted R-squared has increased, courtesy of new variables that are being added recently to our new model.

# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
## [1] 1.732113
# (inference) :

    # Our SSE has improved sig.fig then our previous model!

#[Quiz]: Create a linear regression model to predict the Price using "HarvestRain" & "WinterRain" as indep.var, and answer the following questions:

model4 <- lm( Price ~ HarvestRain + WinterRain, data <- wine)
summary(model4)
## 
## Call:
## lm(formula = Price ~ HarvestRain + WinterRain, data = data <- wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0933 -0.3222 -0.1012  0.3871  1.1877 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.865e+00  6.616e-01  11.888 4.76e-11 ***
## HarvestRain -4.971e-03  1.601e-03  -3.105  0.00516 ** 
## WinterRain  -9.848e-05  9.007e-04  -0.109  0.91392    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5611 on 22 degrees of freedom
## Multiple R-squared:  0.3177, Adjusted R-squared:  0.2557 
## F-statistic: 5.122 on 2 and 22 DF,  p-value: 0.01492
#==================

# Video 5 : Understanding the model and Coefficient


#1st Colomn: ( The Estimate )
  
#        - Gives the Coefficient for --> Intercepts, & for each of the indep.var in our model.

#        - A Coefficient of 0 --> No change to model.

#        - If its not --> remove the var. from our model.
#              - Since they are non-contributing to the performance of our model.

# 2nd Colomn: ( The Standard Error )

#      - Measures --> How much the coefficient is likely to vary from the estimate.

# 3rd Colomn: ( The t-value )

#        - Measure --> Estimate / Std.error
    
#        - -ive, if estimate is -ive or vice-versa.

#        - Larger its |t-value|, More likely is the coefficient --> Sig.fig.

#        - Thus we want indep.var with > |t-value|, in this coloumn.

# 4th colomn: Pr( > |t| )

#        - Measure --> How plausible --> Coefficient is == 0, given the data.

#        - The < it is, the < likely is our Coefficient estimate == 0.

#        - This no. is >, if |t-value| is < & vice-versa.

#        - We want indep.var with < value in this coloumn.


# [Note] : The easiest way to determine if a var. is sig.fig is to look at the folowwing at the end of each row :-

#        - ***         :   Highest level of sig.fig, p-value < 0.001

#       - **          :   Very Sig.fig, p-value b/w 0.001 & 0.01
      
#        - *           :   Sig.fig, p-value b/w 0.01 & 0.05.

#        - A dot(.)    :   Almost sig.fig, p-value b/w 0.05 & 0.10. 

# If we look our Coefficient table, Age & FrancePopulation are both in.sig.fig.
 

# Let's remove Age & FrancePopulation from our model.

model5 <- lm(Price ~ AGST + HarvestRain + WinterRain + Age, data <- wine)

summary(model5)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = data <- wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45470 -0.24273  0.00752  0.19773  0.53637 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.4299802  1.7658975  -1.942 0.066311 .  
## AGST         0.6072093  0.0987022   6.152  5.2e-06 ***
## HarvestRain -0.0039715  0.0008538  -4.652 0.000154 ***
## WinterRain   0.0010755  0.0005073   2.120 0.046694 *  
## Age          0.0239308  0.0080969   2.956 0.007819 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.295 on 20 degrees of freedom
## Multiple R-squared:  0.8286, Adjusted R-squared:  0.7943 
## F-statistic: 24.17 on 4 and 20 DF,  p-value: 2.036e-07
# Adjusted R squared is slightly improved.
# Interestingly, Age which was in.sig.fig earlier is now oddly become sig.fig.
  # Why is it so?

  # Answer: It's due to multicollinearity i.e Age & FrancePopulation are highley correlated.

# To check our assumption, let's build another model and test this claim.
summary(wine)
##       Year          Price         WinterRain         AGST      
##  Min.   :1952   Min.   :6.205   Min.   :376.0   Min.   :14.98  
##  1st Qu.:1960   1st Qu.:6.519   1st Qu.:536.0   1st Qu.:16.20  
##  Median :1966   Median :7.121   Median :600.0   Median :16.53  
##  Mean   :1966   Mean   :7.067   Mean   :605.3   Mean   :16.51  
##  3rd Qu.:1972   3rd Qu.:7.495   3rd Qu.:697.0   3rd Qu.:17.07  
##  Max.   :1978   Max.   :8.494   Max.   :830.0   Max.   :17.65  
##   HarvestRain         Age         FrancePop    
##  Min.   : 38.0   Min.   : 5.0   Min.   :43184  
##  1st Qu.: 89.0   1st Qu.:11.0   1st Qu.:46584  
##  Median :130.0   Median :17.0   Median :50255  
##  Mean   :148.6   Mean   :17.2   Mean   :49694  
##  3rd Qu.:187.0   3rd Qu.:23.0   3rd Qu.:52894  
##  Max.   :292.0   Max.   :31.0   Max.   :54602
model6 <- lm(Price ~ AGST + HarvestRain + WinterRain + FrancePop, data <- wine)
summary(model6)
## 
## Call:
## lm(formula = Price ~ AGST + HarvestRain + WinterRain + FrancePop, 
##     data = data <- wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48252 -0.24636 -0.00699  0.22089  0.51949 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.768e-01  2.180e+00  -0.173 0.864529    
## AGST         6.011e-01  9.898e-02   6.073 6.17e-06 ***
## HarvestRain -3.958e-03  8.518e-04  -4.646 0.000156 ***
## WinterRain   1.042e-03  5.070e-04   2.055 0.053202 .  
## FrancePop   -5.075e-05  1.704e-05  -2.978 0.007434 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2943 on 20 degrees of freedom
## Multiple R-squared:  0.8294, Adjusted R-squared:  0.7952 
## F-statistic:  24.3 on 4 and 20 DF,  p-value: 1.945e-07
# It look like our claim is true.

Corellation & Mulitcollinearity

Corellation is a measure of th linear relationship b/w variables.

Whereas, Mulitcolliniearity –> two indep.var are highley corellated.

  • +1 := perfect positive linear Relationship.

  • 0 := No linear relationship.

  • -1 := perfect negative linear realtionship.

cor( wine $ WinterRain, wine $ Price )
## [1] 0.1366505
cor( wine $ Age, wine $ FrancePop )
## [1] -0.9944851
cor( wine )
##                    Year      Price   WinterRain        AGST HarvestRain
## Year         1.00000000 -0.4477679  0.016970024 -0.24691585  0.02800907
## Price       -0.44776786  1.0000000  0.136650547  0.65956286 -0.56332190
## WinterRain   0.01697002  0.1366505  1.000000000 -0.32109061 -0.27544085
## AGST        -0.24691585  0.6595629 -0.321090611  1.00000000 -0.06449593
## HarvestRain  0.02800907 -0.5633219 -0.275440854 -0.06449593  1.00000000
## Age         -1.00000000  0.4477679 -0.016970024  0.24691585 -0.02800907
## FrancePop    0.99448510 -0.4668616 -0.001621627 -0.25916227  0.04126439
##                     Age    FrancePop
## Year        -1.00000000  0.994485097
## Price        0.44776786 -0.466861641
## WinterRain  -0.01697002 -0.001621627
## AGST         0.24691585 -0.259162274
## HarvestRain -0.02800907  0.041264394
## Age          1.00000000 -0.994485097
## FrancePop   -0.99448510  1.000000000
model7 <- lm( formula <- Price ~ AGST + HarvestRain + WinterRain, data <- wine )
summary(model7)
## 
## Call:
## lm(formula = formula <- Price ~ AGST + HarvestRain + WinterRain, 
##     data = data <- wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67472 -0.12958  0.01973  0.20751  0.63846 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.3016263  2.0366743  -2.112 0.046831 *  
## AGST         0.6810242  0.1117011   6.097 4.75e-06 ***
## HarvestRain -0.0039481  0.0009987  -3.953 0.000726 ***
## WinterRain   0.0011765  0.0005920   1.987 0.060097 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.345 on 21 degrees of freedom
## Multiple R-squared:  0.7537, Adjusted R-squared:  0.7185 
## F-statistic: 21.42 on 3 and 21 DF,  p-value: 1.359e-06

Inference :

  • It turns out that our model looks pretty much the same, except one thing!

    • Our R-squared dropped down to 0.75 vs the model consisting of Age with 0.83.

  • Question is should we keep Age or FrancePop?

    • Removing Age & FrancePop at the same time has caused us to miss out a sig.fig variable.

    • Also, our R-squared value is lowered.

Conclusion: It turns out that it’s better to keep Age rather than FrancPop has intitutivly “Older wines our typically more expensive”, thus Age makes more sense

Typically, a corellation > 0.7 or < than -0.7 is cause for concern.


Making Predictions

Our wine model is quite good in predicting “Prices” of wine on the data it’s seen. But How well does it perform on “new data”?

  • For starters Our wine model had a value of R^2 := 0.83
  • Wine buyers profit from being able to predict the quality of wine years before it matures.
  • The data that we used to build a model –> training data.

  • The new data –> test data.

  • Accuracy of the model on the test data is –> out-of-sample accuracy.

Let’s jump in!

# We have to data points that we haven't used to build our model.
# These are contained in file : "wine_test.csv".

wineTest <- read.csv("C:/Users/pySag/Desktop/MITx15.071x/Datasets/wine_test.csv")
str(wineTest)
## 'data.frame':    2 obs. of  7 variables:
##  $ Year       : int  1979 1980
##  $ Price      : num  6.95 6.5
##  $ WinterRain : int  717 578
##  $ AGST       : num  16.2 16
##  $ HarvestRain: int  122 74
##  $ Age        : int  4 3
##  $ FrancePop  : num  54836 55110
summary(wineTest)
##       Year          Price         WinterRain         AGST      
##  Min.   :1979   Min.   :6.498   Min.   :578.0   Min.   :16.00  
##  1st Qu.:1979   1st Qu.:6.612   1st Qu.:612.8   1st Qu.:16.04  
##  Median :1980   Median :6.726   Median :647.5   Median :16.08  
##  Mean   :1980   Mean   :6.726   Mean   :647.5   Mean   :16.08  
##  3rd Qu.:1980   3rd Qu.:6.840   3rd Qu.:682.2   3rd Qu.:16.13  
##  Max.   :1980   Max.   :6.954   Max.   :717.0   Max.   :16.17  
##   HarvestRain       Age         FrancePop    
##  Min.   : 74   Min.   :3.00   Min.   :54836  
##  1st Qu.: 86   1st Qu.:3.25   1st Qu.:54904  
##  Median : 98   Median :3.50   Median :54973  
##  Mean   : 98   Mean   :3.50   Mean   :54973  
##  3rd Qu.:110   3rd Qu.:3.75   3rd Qu.:55042  
##  Max.   :122   Max.   :4.00   Max.   :55110
# To make a predictions for these two test poing, R offers the "predict()"
# We are going to be using our prevvious model no. 5
predictTest <- predict(model5, newdata <- wineTest)

predictTest
##        1        2 
## 6.768925 6.684910

Inference : Let’s understand what our output is trying to convey info.

  • Prediction price values –> closely matches to price values of test data.

  • But we can quantify this by computing the R-squared value for our test set.

  • Forumuala for R-squared is 1 - SSE / SST

SSE <- sum( ( wineTest $ Price - predictTest ) ^ 2)
SST <- sum( ( wineTest $ Price - mean ( wine $ Price ) ) ^ 2)

# R ^ 2
1 - SSE / SST
## [1] 0.7944278
# Are we done? is that best we can do ?
    # It turns out that out test data is very small, if we want a good "out-of-sample" accuracy in our prediction.

Out-of-Sample R^2

  • Model R-squared increases as we add up more independent variables.

  • However for Test R-squared , its not True.

  • Further, we need more data to be conclusive enough.

  • Out-of-sample R-squared can be -ive.


Sports Analytics

Moneyball

In this R session, we are going to make use of power of modeling & how can we make use of it to provide decisions in a Baseball games that the Okland A’s key decision makers did too to make through playoffs.


The Story of Moneyball

Here are some observations from the above scatterplot:

  • blue dot –> Team : N.Yankees, Avg.wins := 97, Avg.annual.sal := $ 90M.

  • red dot –> Team : R.Soks, Avg.wins := 90(appx.), Avg.annual.sal := $ 80M.

  • green dot –> Team : Oak.A’s, Avg.wins := 90(appx), Avg.year.sal := $ 30M.

Oak.A’s were doing it cost-effectivly if not better than “all-star teams”.


Competing as a Poor Team


Given that Oak.A’s were a poor team & can’t afford “the all-stars”, but they are still making it to the plaoffs. The question was How?

  • Apparently they took a quantitative approach rather than intitutive, emotional & physical!

  • There key policies were as follows:

    * #### Finding undervalued players through each player's indiv. statistics.
    
    * #### If they do, find players at a bargain.
    
    * #### Make them to play and hope they played well, to make it to playoffs.
  • Thus, Creating a team with a good albiet surprisingly mixed performance players.

Comedy of Errors


Making to the playoffs


Let’s see how a baseball team make it to the playoffs:

  • First, Use Analytics.

  • Second, Predict which team will make it to playoffs.

    • How? By knowing how many games they won in the 1st regular season.

    • Using the difference b/w “Runs Scored” & “Opponent Runs”.

  • Third, Building a Linear Regression model to predict the former case.

“Essentially, Paul Reduced the regular season to a math problem”.

  • He judged, making to playoffs –> 95 games to be won!


Validating the claim: Winning 95 Games


Let’s see if we can verify this using data visulization.

Further, to figure this out, we need to answer the following questions:

  • How does a team win games?

  • Score > Opponent.

  • But question is, How many more?

“Turns out A’s calculated that they required to score 135 or more.”

Let’s validate using the data in R!

# Load the dataset.
baseball <- read.csv("C:/Users/pySag/Desktop/MITx15.071x/Datasets/baseball.csv")

# Structure
str(baseball)
## 'data.frame':    1232 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
##  $ Year        : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ RS          : int  734 700 712 734 613 748 669 667 758 726 ...
##  $ RA          : int  688 600 705 806 759 676 588 845 890 670 ...
##  $ W           : int  81 94 93 69 61 85 97 68 64 88 ...
##  $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ Playoffs    : int  0 1 1 0 0 0 1 0 0 1 ...
##  $ RankSeason  : int  NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ RankPlayoffs: int  NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ OSLG        : num  0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
# A bit about the dataset:
  
  # Observations of every team and year pair from 1962 - 2012
  # For all seasons with 162 games.
  # 15 variables, including:

      # RA : Runs Scored.
      # RA : Runs Allowed.
      # W : No. of Winds.

  # & several more, but for this session these would suffice.

# Subsetting our data to year 2002, the data Paul used.
moneyball <- subset( baseball, Year < 2002)
str(moneyball)
## 'data.frame':    902 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
# Calculating the difference
moneyball $ RD <- ( moneyball $ RS - moneyball $ RA )
str(moneyball)
## 'data.frame':    902 obs. of  16 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
##  $ RD          : int  -39 141 86 -142 27 76 3 -115 76 17 ...
# Visualization
plot( moneyball $ RD, moneyball $ W, xlab = "Run difference", ylab = "No. of Wins")
title("RD vs Wins")

# Linear Regression Model
WinsReg <- lm( W ~ RD, data = moneyball )
summary(WinsReg)
## 
## Call:
## lm(formula = W ~ RD, data = moneyball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2662  -2.6509   0.1234   2.9364  11.6570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 80.881375   0.131157  616.67   <2e-16 ***
## RD           0.105766   0.001297   81.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.939 on 900 degrees of freedom
## Multiple R-squared:  0.8808, Adjusted R-squared:  0.8807 
## F-statistic:  6651 on 1 and 900 DF,  p-value: < 2.2e-16

[ Inference ]:

Here are the following observations:

  • Both variables “RD” & “W” show high correlation.

  • They are both highley sig.fig.

  • This is pretty strong model.


[ Mathematically ]:

  • W := 80.881375 + 0.105766 x ( RD )

  • To win we need, W >= 95

  • If we rearrange the terms, 80.881375 + 0.105766 x (RD) >= 95

  • And finally, claim of runs >= 135 is satifyed by above equation.

    • RD >= 95 - 80.881375 / 0.105766 == 133.5, thus almost ~ 135.


Predicting Runs

Now we need to know “How many runs a team will score”, which can be predicted with batting statistics & “How many runs a team will allow”, which can be predicted with fielding-pitching statistics.

  1. Scoring Runs.

  • How does a team score more runs?

  • The Oak.A’s discovered that two baseball statistics were sig.fig predictive of runs scored.

  • On-base percentage (OBP) : percentage of time a player get on base.

  • Slugging Pecentage (SLG) : How far a player gets around the base on his turn.( measures power )

  • Batting Avearage (BA) : Most team focus on this.

  1. Most important statistics:

  • OBP with sig.fig level of 3 stars.

  • SLG with 2 level of sig.fig.

  • BA with least sig.fig

*** Question is, can we use all this info in our linear model to verify which baseball stats are more important to predict runs?***

str(moneyball)
## 'data.frame':    902 obs. of  16 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ...
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ RS          : int  691 818 729 687 772 777 798 735 897 923 ...
##  $ RA          : int  730 677 643 829 745 701 795 850 821 906 ...
##  $ W           : int  75 92 88 63 82 88 83 66 91 73 ...
##  $ OBP         : num  0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ...
##  $ SLG         : num  0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ...
##  $ BA          : num  0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ...
##  $ Playoffs    : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ RankSeason  : int  NA 5 7 NA NA NA NA NA 6 NA ...
##  $ RankPlayoffs: int  NA 1 3 NA NA NA NA NA 4 NA ...
##  $ G           : int  162 162 162 162 161 162 162 162 162 162 ...
##  $ OOBP        : num  0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ...
##  $ OSLG        : num  0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
##  $ RD          : int  -39 141 86 -142 27 76 3 -115 76 17 ...
RunsReg <- lm( RS ~ OBP + SLG + BA, data = moneyball)
summary(RunsReg)
## 
## Call:
## lm(formula = RS ~ OBP + SLG + BA, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.941 -17.247  -0.621  16.754  90.998 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -788.46      19.70 -40.029  < 2e-16 ***
## OBP          2917.42     110.47  26.410  < 2e-16 ***
## SLG          1637.93      45.99  35.612  < 2e-16 ***
## BA           -368.97     130.58  -2.826  0.00482 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.69 on 898 degrees of freedom
## Multiple R-squared:  0.9302, Adjusted R-squared:   0.93 
## F-statistic:  3989 on 3 and 898 DF,  p-value: < 2.2e-16
  1. Inference

  • All indep.var are sig.fig

  • R-squared := 0.9302

  • Coefficient for BA is -ive

    • This implies, all else being equal, a team with < BA will score more.

    • The above observation is a little counterintitutive, What’s happening?

  • This is a case of Multicollinearity, each indep.var is highley correlated.

    • Thus its hard to interpret coefficient of our model.

  • Thus removing BA from our model may help.

RunsReg <- lm( RS ~ OBP + SLG, data = moneyball)
summary(RunsReg)
## 
## Call:
## lm(formula = RS ~ OBP + SLG, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.838 -17.174  -1.108  16.770  90.036 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -804.63      18.92  -42.53   <2e-16 ***
## OBP          2737.77      90.68   30.19   <2e-16 ***
## SLG          1584.91      42.16   37.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.79 on 899 degrees of freedom
## Multiple R-squared:  0.9296, Adjusted R-squared:  0.9294 
## F-statistic:  5934 on 2 and 899 DF,  p-value: < 2.2e-16
  1. New Inference

  • All indep.var are sig.fig

  • R-squared := 0.9296, which is ~ 0.9302

  • Only change that we can observe is, Coeff are now +ive.

  • This has sig.fig improved our model, it’s much more simpler and powerful.


_Conclusion__

#### *** We were able to verify the claims made in Moneyball**

  • That the BA is overvalued.

  • While OBP is the most improtant statistics, followed by SLG.

  • And that allowing runs :

    • OOBP ( Opponent On-Base Percentage )

    • OOSP ( " " " Slugging Percentage )

summary(moneyball)
##       Team     League        Year            RS               RA        
##  BAL    : 36   AL:462   Min.   :1962   Min.   : 463.0   Min.   : 472.0  
##  BOS    : 36   NL:440   1st Qu.:1973   1st Qu.: 641.2   1st Qu.: 640.0  
##  CHC    : 36            Median :1983   Median : 695.0   Median : 697.0  
##  CHW    : 36            Mean   :1982   Mean   : 703.8   Mean   : 703.8  
##  CIN    : 36            3rd Qu.:1992   3rd Qu.: 761.8   3rd Qu.: 763.0  
##  CLE    : 36            Max.   :2001   Max.   :1009.0   Max.   :1103.0  
##  (Other):686                                                            
##        W               OBP             SLG               BA        
##  Min.   : 40.00   Min.   :0.277   Min.   :0.3010   Min.   :0.2140  
##  1st Qu.: 73.00   1st Qu.:0.314   1st Qu.:0.3680   1st Qu.:0.2500  
##  Median : 81.00   Median :0.324   Median :0.3880   Median :0.2580  
##  Mean   : 80.88   Mean   :0.325   Mean   :0.3904   Mean   :0.2582  
##  3rd Qu.: 89.00   3rd Qu.:0.335   3rd Qu.:0.4118   3rd Qu.:0.2670  
##  Max.   :116.00   Max.   :0.373   Max.   :0.4850   Max.   :0.2940  
##                                                                    
##     Playoffs        RankSeason     RankPlayoffs         G        
##  Min.   :0.0000   Min.   :1.000   Min.   :1.000   Min.   :158.0  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:162.0  
##  Median :0.0000   Median :2.500   Median :3.000   Median :162.0  
##  Mean   :0.1707   Mean   :2.792   Mean   :2.454   Mean   :161.9  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:162.0  
##  Max.   :1.0000   Max.   :8.000   Max.   :4.000   Max.   :165.0  
##                   NA's   :748     NA's   :748                    
##       OOBP             OSLG              RD         
##  Min.   :0.3010   Min.   :0.3770   Min.   :-331.00  
##  1st Qu.:0.3290   1st Qu.:0.4160   1st Qu.: -70.75  
##  Median :0.3420   Median :0.4325   Median :   3.00  
##  Mean   :0.3405   Mean   :0.4325   Mean   :   0.00  
##  3rd Qu.:0.3500   3rd Qu.:0.4508   3rd Qu.:  69.75  
##  Max.   :0.3840   Max.   :0.4990   Max.   : 309.00  
##  NA's   :812      NA's   :812
RunsAll <- lm( RA ~ OOBP + OSLG, data = moneyball )
summary(RunsAll)
## 
## Call:
## lm(formula = RA ~ OOBP + OSLG, data = moneyball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.397 -15.178  -0.129  17.679  60.955 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -837.38      60.26 -13.897  < 2e-16 ***
## OOBP         2913.60     291.97   9.979 4.46e-16 ***
## OSLG         1514.29     175.43   8.632 2.55e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.67 on 87 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.9073, Adjusted R-squared:  0.9052 
## F-statistic: 425.8 on 2 and 87 DF,  p-value: < 2.2e-16
  • We get Linear Regression model:

    • RA := -837.38 + 2913.60( OOBP ) + 1514.29( OSLG )

  • R-squared := 0.91

  • Both variable are sig.fig.

#### Key : Simple models using only a couple of independent variables can be constructed to answer some of the most important questions in sports

Using the Models to Make Predictions


  • Can We Predict how many games the 2002 Oak.A’s gonna win using our model?

  • The models for runs use team statistics.

  • Each year a baseball team is different.

  • But we can estimate the new team statistis usng past player performance.

  • Assumes past performance correlates with the future performance.

  • We can estimate the team statistics for 2002 using the 2002 player stats.


[ Predicting Runs Scored ]:

  • At the beginning of the 2002 season, the Oak.A’s had 24 batters.

  • Using the 2001 regular sesons stats for these players

    • Team OBP := 0.339

    • Team SLG := 0.430

  • Our Regression equation was:

    • RS := -804.63 + 2737.77( OBP ) + 1584.91( SLG )

    • All we have to do is substitute the values:

    • RS := -804.63 + 2737.77( 0.339 ) + 1584.91( 0.430 ) = 805


[ Predicting Wins ]

  • OUr regression equatin to predict wins was:

    • Wins := 80.8814 + 0.1058( RS - LA )

  • We pedicted

    • RS := 805

    • RA := 622

  • So our prediction for wins is:

    • Wins := 80.8814 + 0.1058( 805 - 622 )

    • := 100

  • Paul DePodesta used a similar approach to make predictions.

  • Prediction closely match actual performance

| | Our Prediction | Paul’s Prediction | Actual

————————————————————-

Runs Scored: | 805 | 800 - 820 | 800

Runs Allowed | 622 | 650 - 670 | 653

Wins | 100 | 93 - 97 | 103


[ Luck in the Playoffs ]

Billy & Paul see their jobs as making sure the team makes it to playoffs.

  • Oak.A’s made it to playoffs in 2000, 2001, 2002, 2003

  • They didn’t win the World series!

  • Why?

    • It turns out there further analysis was plagued with a statistical phenomena called as “Sample Size problem”.

[ Is Playoff Performance Predictable? ]

  • Using data 1994-2011 ( 8 teams in the playoffs ).

  • We shall use these 8 teams.

  • Correlation b/w winning the World Series & regular season wins is 0.03.

    • The correlation is very low.

    • So Winning regular season games gets you to the playoffs.

  • Logistic Regression can be used to predicted whether or not a team will win the world series

Quick Question : finding correlations

# Create vectors
teamRank <- c( 1,2,3,3,4,4,4,4,5,5 )
wins2012 <- c(94, 88, 95, 88, 93, 94, 98, 97, 93, 94)
wins2013 <- c(97, 97, 92, 93, 92, 96, 94, 96, 92, 90)

# Combine them into a single dataframe
df1 <- data.frame( teamRank, wins2012 )
df2 <- data.frame( teamRank, wins2013 )

# Compute the Carl.Pearson coefficient
cor( df1, use = "complete.obs", method = "pearson")
##           teamRank  wins2012
## teamRank 1.0000000 0.3477129
## wins2012 0.3477129 1.0000000
cor( df2, use = "complete.obs", method = "pearson" )
##            teamRank   wins2013
## teamRank  1.0000000 -0.6556945
## wins2013 -0.6556945  1.0000000

Recitation 2


# Loading our data set
NBA <- read.csv("C:/Users/pySag/Desktop/MITx15.071x/Datasets/NBA_train.csv")

# Structure of our dataset
str(NBA)
## 'data.frame':    835 obs. of  20 variables:
##  $ SeasonEnd: int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ Team     : Factor w/ 37 levels "Atlanta Hawks",..: 1 2 5 6 8 9 10 11 12 13 ...
##  $ Playoffs : int  1 1 0 0 0 0 0 1 0 1 ...
##  $ W        : int  50 61 30 37 30 16 24 41 37 47 ...
##  $ PTS      : int  8573 9303 8813 9360 8878 8933 8493 9084 9119 8860 ...
##  $ oppPTS   : int  8334 8664 9035 9332 9240 9609 8853 9070 9176 8603 ...
##  $ FG       : int  3261 3617 3362 3811 3462 3643 3527 3599 3639 3582 ...
##  $ FGA      : int  7027 7387 6943 8041 7470 7596 7318 7496 7689 7489 ...
##  $ X2P      : int  3248 3455 3292 3775 3379 3586 3500 3495 3551 3557 ...
##  $ X2PA     : int  6952 6965 6668 7854 7215 7377 7197 7117 7375 7375 ...
##  $ X3P      : int  13 162 70 36 83 57 27 104 88 25 ...
##  $ X3PA     : int  75 422 275 187 255 219 121 379 314 114 ...
##  $ FT       : int  2038 1907 2019 1702 1871 1590 1412 1782 1753 1671 ...
##  $ FTA      : int  2645 2449 2592 2205 2539 2149 1914 2326 2333 2250 ...
##  $ ORB      : int  1369 1227 1115 1307 1311 1226 1155 1394 1398 1187 ...
##  $ DRB      : int  2406 2457 2465 2381 2524 2415 2437 2217 2326 2429 ...
##  $ AST      : int  1913 2198 2152 2108 2079 1950 2028 2149 2148 2123 ...
##  $ STL      : int  782 809 704 764 746 783 779 782 900 863 ...
##  $ BLK      : int  539 308 392 342 404 562 339 373 530 356 ...
##  $ TOV      : int  1495 1539 1684 1370 1533 1742 1492 1565 1517 1439 ...
# Summary of our dataset
summary(NBA)
##    SeasonEnd                     Team        Playoffs            W       
##  Min.   :1980   Atlanta Hawks      : 31   Min.   :0.0000   Min.   :11.0  
##  1st Qu.:1989   Boston Celtics     : 31   1st Qu.:0.0000   1st Qu.:31.0  
##  Median :1996   Chicago Bulls      : 31   Median :1.0000   Median :42.0  
##  Mean   :1996   Cleveland Cavaliers: 31   Mean   :0.5749   Mean   :41.0  
##  3rd Qu.:2005   Denver Nuggets     : 31   3rd Qu.:1.0000   3rd Qu.:50.5  
##  Max.   :2011   Detroit Pistons    : 31   Max.   :1.0000   Max.   :72.0  
##                 (Other)            :649                                  
##       PTS            oppPTS            FG            FGA      
##  Min.   : 6901   Min.   : 6909   Min.   :2565   Min.   :5972  
##  1st Qu.: 7934   1st Qu.: 7934   1st Qu.:2974   1st Qu.:6564  
##  Median : 8312   Median : 8365   Median :3150   Median :6831  
##  Mean   : 8370   Mean   : 8370   Mean   :3200   Mean   :6873  
##  3rd Qu.: 8784   3rd Qu.: 8768   3rd Qu.:3434   3rd Qu.:7157  
##  Max.   :10371   Max.   :10723   Max.   :3980   Max.   :8868  
##                                                               
##       X2P            X2PA           X3P             X3PA       
##  Min.   :1981   Min.   :4153   Min.   : 10.0   Min.   :  75.0  
##  1st Qu.:2510   1st Qu.:5269   1st Qu.:131.5   1st Qu.: 413.0  
##  Median :2718   Median :5706   Median :329.0   Median : 942.0  
##  Mean   :2881   Mean   :5956   Mean   :319.0   Mean   : 916.9  
##  3rd Qu.:3296   3rd Qu.:6754   3rd Qu.:481.5   3rd Qu.:1347.5  
##  Max.   :3954   Max.   :7873   Max.   :841.0   Max.   :2284.0  
##                                                                
##        FT            FTA            ORB              DRB      
##  Min.   :1189   Min.   :1475   Min.   : 639.0   Min.   :2044  
##  1st Qu.:1502   1st Qu.:2008   1st Qu.: 953.5   1st Qu.:2346  
##  Median :1628   Median :2176   Median :1055.0   Median :2433  
##  Mean   :1650   Mean   :2190   Mean   :1061.6   Mean   :2427  
##  3rd Qu.:1781   3rd Qu.:2352   3rd Qu.:1167.0   3rd Qu.:2516  
##  Max.   :2388   Max.   :3051   Max.   :1520.0   Max.   :2753  
##                                                               
##       AST            STL              BLK             TOV      
##  Min.   :1423   Min.   : 455.0   Min.   :204.0   Min.   : 931  
##  1st Qu.:1735   1st Qu.: 599.0   1st Qu.:359.0   1st Qu.:1192  
##  Median :1899   Median : 658.0   Median :410.0   Median :1289  
##  Mean   :1912   Mean   : 668.4   Mean   :419.8   Mean   :1303  
##  3rd Qu.:2078   3rd Qu.: 729.0   3rd Qu.:469.5   3rd Qu.:1396  
##  Max.   :2575   Max.   :1053.0   Max.   :716.0   Max.   :1873  
## 

How many games does a team needs to win to make it to the playoffs?


# Using the table command
table( NBA $ W, NBA $ Playoffs )
##     
##       0  1
##   11  2  0
##   12  2  0
##   13  2  0
##   14  2  0
##   15 10  0
##   16  2  0
##   17 11  0
##   18  5  0
##   19 10  0
##   20 10  0
##   21 12  0
##   22 11  0
##   23 11  0
##   24 18  0
##   25 11  0
##   26 17  0
##   27 10  0
##   28 18  0
##   29 12  0
##   30 19  1
##   31 15  1
##   32 12  0
##   33 17  0
##   34 16  0
##   35 13  3
##   36 17  4
##   37 15  4
##   38  8  7
##   39 10 10
##   40  9 13
##   41 11 26
##   42  8 29
##   43  2 18
##   44  2 27
##   45  3 22
##   46  1 15
##   47  0 28
##   48  1 14
##   49  0 17
##   50  0 32
##   51  0 12
##   52  0 20
##   53  0 17
##   54  0 18
##   55  0 24
##   56  0 16
##   57  0 23
##   58  0 13
##   59  0 14
##   60  0  8
##   61  0 10
##   62  0 13
##   63  0  7
##   64  0  3
##   65  0  3
##   66  0  2
##   67  0  4
##   69  0  1
##   72  0  1
# Interpreting the results:
  
  # W | False | True
  
  # If a team tries to win 42 games := will make it to playoffs


# Predicting the no. of game a team will win.

    # For this, we will use the difference of "points scored" & "points allowed"

NBA $ ptdf <- ( NBA $ PTS - NBA $ oppPTS )

plot( NBA $ W, NBA $ ptdf, xlab = "No of Wins", ylab = "Points difference", col = "red")
title( main = "Scatter Plot: Wins vs Total Point Difference")

# Verification
WinsReg <- lm( W ~ ptdf, data = NBA)
summary(WinsReg)
## 
## Call:
## lm(formula = W ~ ptdf, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7393 -2.1018 -0.0672  2.0265 10.6026 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.100e+01  1.059e-01   387.0   <2e-16 ***
## ptdf        3.259e-02  2.793e-04   116.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.061 on 833 degrees of freedom
## Multiple R-squared:  0.9423, Adjusted R-squared:  0.9423 
## F-statistic: 1.361e+04 on 1 and 833 DF,  p-value: < 2.2e-16
# Regression Equation, W := 41 + 0.03259 * ptdfs
# To win, need 42 wins, what does it mean in terms of ptdfs?
# It would mean: ptdfs := (42 - 41 ) / 0.03259
# := 31, which means we need 31 more points to win 42 games!

Predicting points scored


In order to do so:

  • Dependent Variable will be PTS.

  • Indpe.variables will be, all variables except PTS

Let’s Build our model!

PTSreg <- lm( PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + STL + BLK, data = NBA)
summary(PTSreg)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + 
##     STL + BLK, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -527.40 -119.83    7.83  120.67  564.71 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.051e+03  2.035e+02 -10.078   <2e-16 ***
## X2PA         1.043e+00  2.957e-02  35.274   <2e-16 ***
## X3PA         1.259e+00  3.843e-02  32.747   <2e-16 ***
## FTA          1.128e+00  3.373e-02  33.440   <2e-16 ***
## AST          8.858e-01  4.396e-02  20.150   <2e-16 ***
## ORB         -9.554e-01  7.792e-02 -12.261   <2e-16 ***
## DRB          3.883e-02  6.157e-02   0.631   0.5285    
## TOV         -2.475e-02  6.118e-02  -0.405   0.6859    
## STL         -1.992e-01  9.181e-02  -2.169   0.0303 *  
## BLK         -5.576e-02  8.782e-02  -0.635   0.5256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared:  0.8992, Adjusted R-squared:  0.8981 
## F-statistic: 817.3 on 9 and 825 DF,  p-value: < 2.2e-16
# Let's compute the residuals
table( PTSreg $ residuals ) [1:10]
## 
## -527.398893864954 -526.146884817283 -508.325040454119 -502.478571455115 
##                 1                 1                 1                 1 
## -475.167837837594 -452.859686254576 -435.627062120331 -435.535549436342 
##                 1                 1                 1                 1 
## -433.124335835753 -431.134658988933 
##                 1                 1
SSE <- sum( PTSreg $ residuals ^ 2) 
SSE
## [1] 28394314
RMSE <- sqrt( SSE / nrow(NBA))
RMSE
## [1] 184.4049
# Inference:
      
    # Both "SSE" & "RMSE" is quite high
    # But off by 184 is less sig.fig and not so bad.

# if we look at the average total no of points in a season.
mean( NBA $ PTS )
## [1] 8370.24

Improving our Model


Let’s remove some of the indep.vars from our linear regression model to check if we still can improve this model by a fraction.

We will be removing these independent variables one at a time.

Which indep.var to remove first depends on its p value.

  • if its high, it means its least statistically sig.fig

summary(PTSreg)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + TOV + 
##     STL + BLK, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -527.40 -119.83    7.83  120.67  564.71 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.051e+03  2.035e+02 -10.078   <2e-16 ***
## X2PA         1.043e+00  2.957e-02  35.274   <2e-16 ***
## X3PA         1.259e+00  3.843e-02  32.747   <2e-16 ***
## FTA          1.128e+00  3.373e-02  33.440   <2e-16 ***
## AST          8.858e-01  4.396e-02  20.150   <2e-16 ***
## ORB         -9.554e-01  7.792e-02 -12.261   <2e-16 ***
## DRB          3.883e-02  6.157e-02   0.631   0.5285    
## TOV         -2.475e-02  6.118e-02  -0.405   0.6859    
## STL         -1.992e-01  9.181e-02  -2.169   0.0303 *  
## BLK         -5.576e-02  8.782e-02  -0.635   0.5256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.5 on 825 degrees of freedom
## Multiple R-squared:  0.8992, Adjusted R-squared:  0.8981 
## F-statistic: 817.3 on 9 and 825 DF,  p-value: < 2.2e-16
# removing TOV
PTSreg2 <- lm( PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + BLK, data = NBA)

summary(PTSreg2)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL + 
##     BLK, data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -526.79 -121.09    6.37  120.74  565.94 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.077e+03  1.931e+02 -10.755   <2e-16 ***
## X2PA         1.044e+00  2.951e-02  35.366   <2e-16 ***
## X3PA         1.263e+00  3.703e-02  34.099   <2e-16 ***
## FTA          1.125e+00  3.308e-02  34.023   <2e-16 ***
## AST          8.861e-01  4.393e-02  20.173   <2e-16 ***
## ORB         -9.581e-01  7.758e-02 -12.350   <2e-16 ***
## DRB          3.892e-02  6.154e-02   0.632   0.5273    
## STL         -2.068e-01  8.984e-02  -2.301   0.0216 *  
## BLK         -5.863e-02  8.749e-02  -0.670   0.5029    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.4 on 826 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8982 
## F-statistic: 920.4 on 8 and 826 DF,  p-value: < 2.2e-16
# removing DRB
PTSreg3 <- lm( PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, data = NBA)

summary( PTSreg3 )
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + STL + BLK, 
##     data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -523.79 -121.64    6.07  120.81  573.64 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.015e+03  1.670e+02 -12.068  < 2e-16 ***
## X2PA         1.048e+00  2.852e-02  36.753  < 2e-16 ***
## X3PA         1.271e+00  3.475e-02  36.568  < 2e-16 ***
## FTA          1.128e+00  3.270e-02  34.506  < 2e-16 ***
## AST          8.909e-01  4.326e-02  20.597  < 2e-16 ***
## ORB         -9.702e-01  7.519e-02 -12.903  < 2e-16 ***
## STL         -2.276e-01  8.356e-02  -2.724  0.00659 ** 
## BLK         -3.882e-02  8.165e-02  -0.475  0.63462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.4 on 827 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8982 
## F-statistic:  1053 on 7 and 827 DF,  p-value: < 2.2e-16
# removing BLK
PTSreg4 <- lm( PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL, data = NBA)
summary( PTSreg4)
## 
## Call:
## lm(formula = PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + STL, 
##     data = NBA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -525.04 -121.76    7.95  118.43  561.78 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.076e+03  1.930e+02 -10.756   <2e-16 ***
## X2PA         1.048e+00  2.888e-02  36.272   <2e-16 ***
## X3PA         1.269e+00  3.590e-02  35.339   <2e-16 ***
## FTA          1.125e+00  3.306e-02  34.028   <2e-16 ***
## AST          8.847e-01  4.386e-02  20.172   <2e-16 ***
## ORB         -9.681e-01  7.611e-02 -12.720   <2e-16 ***
## DRB          2.415e-02  5.744e-02   0.420   0.6743    
## STL         -2.137e-01  8.921e-02  -2.395   0.0168 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.4 on 827 degrees of freedom
## Multiple R-squared:  0.8991, Adjusted R-squared:  0.8982 
## F-statistic:  1053 on 7 and 827 DF,  p-value: < 2.2e-16
# SSE
SSE4 <- sum( PTSreg $ residuals ^ 2 )
SSE4
## [1] 28394314
RMSE <- sqrt( SSE4 / nrow(NBA) )
RMSE
## [1] 184.4049

Predicting 2012-2013 season

For this, we will be requring our training set “NBA_test”.

# Load the training set
NBAtest <- read.csv("C:/Users/pySag/Desktop/MITx15.071x/Datasets/NBA_test.csv")

pointsPrediction <- predict( PTSreg4, newdata = NBAtest)

# How good our prediction?
# We can compute "Out-of-sample R-squared"

    # This is the measure of "How well the model predicts on test data"

# Recall, our "In-sample R-squared" := 0.8991
  
    # measure --> "How well the model fits the training data"

# SSE := sum( prediction data - original data )
SSE <- sum( ( pointsPrediction - NBAtest $ PTS) ^ 2 )
SSE
## [1] 1073074
# SST
SST <- sum( ( mean( NBA $ PTS ) - NBAtest $ PTS ) ^ 2 )
R2 <- 1 - SSE / SST
R2
## [1] 0.8138702
RMSE <- sqrt( SSE / nrow(NBAtest) )
RMSE
## [1] 195.7653

Assignment 2

CLIMATE CHANGE

In this problem, we will attempt to study the relationship b/w average global temperature and several other factors.

The file climate_change.csv contains data from May 1983 to Dec 2008. The avilable variables include:

  • The Observation year.

  • The observation month.

  • Temp

Creating our First Model

# loading our dataset
GBtemp <- read.csv("C:/Users/pySag/Desktop/MITx15.071x/Datasets/climate_change.csv")

# str of our dataset
str(GBtemp)
## 'data.frame':    308 obs. of  11 variables:
##  $ Year    : int  1983 1983 1983 1983 1983 1983 1983 1983 1984 1984 ...
##  $ Month   : int  5 6 7 8 9 10 11 12 1 2 ...
##  $ MEI     : num  2.556 2.167 1.741 1.13 0.428 ...
##  $ CO2     : num  346 346 344 342 340 ...
##  $ CH4     : num  1639 1634 1633 1631 1648 ...
##  $ N2O     : num  304 304 304 304 304 ...
##  $ CFC.11  : num  191 192 193 194 194 ...
##  $ CFC.12  : num  350 352 354 356 357 ...
##  $ TSI     : num  1366 1366 1366 1366 1366 ...
##  $ Aerosols: num  0.0863 0.0794 0.0731 0.0673 0.0619 0.0569 0.0524 0.0486 0.0451 0.0416 ...
##  $ Temp    : num  0.109 0.118 0.137 0.176 0.149 0.093 0.232 0.078 0.089 0.013 ...
# summary our dataset
summary(GBtemp)
##       Year          Month             MEI               CO2       
##  Min.   :1983   Min.   : 1.000   Min.   :-1.6350   Min.   :340.2  
##  1st Qu.:1989   1st Qu.: 4.000   1st Qu.:-0.3987   1st Qu.:353.0  
##  Median :1996   Median : 7.000   Median : 0.2375   Median :361.7  
##  Mean   :1996   Mean   : 6.552   Mean   : 0.2756   Mean   :363.2  
##  3rd Qu.:2002   3rd Qu.:10.000   3rd Qu.: 0.8305   3rd Qu.:373.5  
##  Max.   :2008   Max.   :12.000   Max.   : 3.0010   Max.   :388.5  
##       CH4            N2O            CFC.11          CFC.12     
##  Min.   :1630   Min.   :303.7   Min.   :191.3   Min.   :350.1  
##  1st Qu.:1722   1st Qu.:308.1   1st Qu.:246.3   1st Qu.:472.4  
##  Median :1764   Median :311.5   Median :258.3   Median :528.4  
##  Mean   :1750   Mean   :312.4   Mean   :252.0   Mean   :497.5  
##  3rd Qu.:1787   3rd Qu.:317.0   3rd Qu.:267.0   3rd Qu.:540.5  
##  Max.   :1814   Max.   :322.2   Max.   :271.5   Max.   :543.8  
##       TSI          Aerosols            Temp        
##  Min.   :1365   Min.   :0.00160   Min.   :-0.2820  
##  1st Qu.:1366   1st Qu.:0.00280   1st Qu.: 0.1217  
##  Median :1366   Median :0.00575   Median : 0.2480  
##  Mean   :1366   Mean   :0.01666   Mean   : 0.2568  
##  3rd Qu.:1366   3rd Qu.:0.01260   3rd Qu.: 0.4073  
##  Max.   :1367   Max.   :0.14940   Max.   : 0.7390
# subset our dataset into two:

    # 1. A training set.

    # 2. A test set.

GBtemp_Training <- subset(GBtemp, Year < 2007 ) # training set

GBtemp.Test <- subset(GBtemp, Year > 2006 ) # test set

# Building our linear regression model
tempModel1 <- lm( Temp ~ . -Year -Month, data = GBtemp_Training )

summary(tempModel1)
## 
## Call:
## lm(formula = Temp ~ . - Year - Month, data = GBtemp_Training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25888 -0.05913 -0.00082  0.05649  0.32433 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.246e+02  1.989e+01  -6.265 1.43e-09 ***
## MEI          6.421e-02  6.470e-03   9.923  < 2e-16 ***
## CO2          6.457e-03  2.285e-03   2.826  0.00505 ** 
## CH4          1.240e-04  5.158e-04   0.240  0.81015    
## N2O         -1.653e-02  8.565e-03  -1.930  0.05467 .  
## CFC.11      -6.631e-03  1.626e-03  -4.078 5.96e-05 ***
## CFC.12       3.808e-03  1.014e-03   3.757  0.00021 ***
## TSI          9.314e-02  1.475e-02   6.313 1.10e-09 ***
## Aerosols    -1.538e+00  2.133e-01  -7.210 5.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09171 on 275 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7436 
## F-statistic: 103.6 on 8 and 275 DF,  p-value: < 2.2e-16
# (1.2) Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05 ?

# MEI, CO2, CFC.11, CFC.12, TSI, Aerosols

Understanding the model

# (2.1) We can find if there is a correlation between N2O & CFC11
# let's create a new data frame.

df3 <- data.frame( GBtemp_Training $ N2O, GBtemp_Training $ CFC.11 )
str(df3)
## 'data.frame':    284 obs. of  2 variables:
##  $ GBtemp_Training.N2O   : num  304 304 304 304 304 ...
##  $ GBtemp_Training.CFC.11: num  191 192 193 194 194 ...
# if they are highley correlated then we can accept the initial hypothesis.
# Null Hypothesis : A -ive coefficient value of both varibales indicates lowering of temperature.
# Alt.Hypothesis : These two variables don't contribute to lowering temperatrue, i.e correlation b/w these two is sig.fig less. ( And that there's no sufficient amount of data in favor of this claim )

cor(df3, use = "complete.obs", method = "pearson")
##                        GBtemp_Training.N2O GBtemp_Training.CFC.11
## GBtemp_Training.N2O              1.0000000              0.5224773
## GBtemp_Training.CFC.11           0.5224773              1.0000000
# Inference : Accept our Null Hypothesis

# (2.2) Compute the correlation b/w all the variables in the training set.
# Which of the following independent variable is N2O higley correlated.

    # Absoloute correlation > 0.7

temp1 <- data.frame( GBtemp_Training )
temp2 <- data.frame( GBtemp_Training $ N2O )


temp3 <- data.frame( temp1, temp2 )
cor( temp3, use = "complete.obs", method = "pearson")
##                            Year         Month           MEI         CO2
## Year                 1.00000000 -0.0279419602 -0.0369876842  0.98274939
## Month               -0.02794196  1.0000000000  0.0008846905 -0.10673246
## MEI                 -0.03698768  0.0008846905  1.0000000000 -0.04114717
## CO2                  0.98274939 -0.1067324607 -0.0411471651  1.00000000
## CH4                  0.91565945  0.0185686624 -0.0334193014  0.87727963
## N2O                  0.99384523  0.0136315303 -0.0508197755  0.97671982
## CFC.11               0.56910643 -0.0131112236  0.0690004387  0.51405975
## CFC.12               0.89701166  0.0006751102  0.0082855443  0.85268963
## TSI                  0.17030201 -0.0346061935 -0.1544919227  0.17742893
## Aerosols            -0.34524670  0.0148895406  0.3402377871 -0.35615480
## Temp                 0.78679714 -0.0998567411  0.1724707512  0.78852921
## GBtemp_Training.N2O  0.99384523  0.0136315303 -0.0508197755  0.97671982
##                             CH4         N2O      CFC.11        CFC.12
## Year                 0.91565945  0.99384523  0.56910643  0.8970116635
## Month                0.01856866  0.01363153 -0.01311122  0.0006751102
## MEI                 -0.03341930 -0.05081978  0.06900044  0.0082855443
## CO2                  0.87727963  0.97671982  0.51405975  0.8526896272
## CH4                  1.00000000  0.89983864  0.77990402  0.9636162478
## N2O                  0.89983864  1.00000000  0.52247732  0.8679307757
## CFC.11               0.77990402  0.52247732  1.00000000  0.8689851828
## CFC.12               0.96361625  0.86793078  0.86898518  1.0000000000
## TSI                  0.24552844  0.19975668  0.27204596  0.2553028138
## Aerosols            -0.26780919 -0.33705457 -0.04392120 -0.2251312440
## Temp                 0.70325502  0.77863893  0.40771029  0.6875575483
## GBtemp_Training.N2O  0.89983864  1.00000000  0.52247732  0.8679307757
##                             TSI    Aerosols        Temp
## Year                 0.17030201 -0.34524670  0.78679714
## Month               -0.03460619  0.01488954 -0.09985674
## MEI                 -0.15449192  0.34023779  0.17247075
## CO2                  0.17742893 -0.35615480  0.78852921
## CH4                  0.24552844 -0.26780919  0.70325502
## N2O                  0.19975668 -0.33705457  0.77863893
## CFC.11               0.27204596 -0.04392120  0.40771029
## CFC.12               0.25530281 -0.22513124  0.68755755
## TSI                  1.00000000  0.05211651  0.24338269
## Aerosols             0.05211651  1.00000000 -0.38491375
## Temp                 0.24338269 -0.38491375  1.00000000
## GBtemp_Training.N2O  0.19975668 -0.33705457  0.77863893
##                     GBtemp_Training.N2O
## Year                         0.99384523
## Month                        0.01363153
## MEI                         -0.05081978
## CO2                          0.97671982
## CH4                          0.89983864
## N2O                          1.00000000
## CFC.11                       0.52247732
## CFC.12                       0.86793078
## TSI                          0.19975668
## Aerosols                    -0.33705457
## Temp                         0.77863893
## GBtemp_Training.N2O          1.00000000

3. Simplifying the model

Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables.

# Problem 3
# For this we going to be using our "training set" to build our model
newModel <- lm( Temp ~ MEI + TSI + Aerosols + N2O, data = GBtemp_Training )
summary(newModel)
## 
## Call:
## lm(formula = Temp ~ MEI + TSI + Aerosols + N2O, data = GBtemp_Training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27916 -0.05975 -0.00595  0.05672  0.34195 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.162e+02  2.022e+01  -5.747 2.37e-08 ***
## MEI          6.419e-02  6.652e-03   9.649  < 2e-16 ***
## TSI          7.949e-02  1.487e-02   5.344 1.89e-07 ***
## Aerosols    -1.702e+00  2.180e-01  -7.806 1.19e-13 ***
## N2O          2.532e-02  1.311e-03  19.307  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09547 on 279 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7222 
## F-statistic: 184.9 on 4 and 279 DF,  p-value: < 2.2e-16

4. Automatically Building the Model

Our previous was no good, and dropping combination of variables from our model is a tidy task, hence following are some methods to ease out this process:

  • R provides a function step, that will automate this tidy process.

  • step also take cares of Multiple R-squared.

  • But has it’s tradeoffs - No. of indep.var lessens, which can sig.fig impact the accuracy of the model.

  • This trade-off is formalized by AIC( Akaike information Criterion )

  • step uses “one argument” : Name of the initial model.

    • returns : Simplar Model!


Use the step function in R to derive a new model

newModel2 <- step( tempModel1 )
## Start:  AIC=-1348.16
## Temp ~ (Year + Month + MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + 
##     TSI + Aerosols) - Year - Month
## 
##            Df Sum of Sq    RSS     AIC
## - CH4       1   0.00049 2.3135 -1350.1
## <none>                  2.3130 -1348.2
## - N2O       1   0.03132 2.3443 -1346.3
## - CO2       1   0.06719 2.3802 -1342.0
## - CFC.12    1   0.11874 2.4318 -1335.9
## - CFC.11    1   0.13986 2.4529 -1333.5
## - TSI       1   0.33516 2.6482 -1311.7
## - Aerosols  1   0.43727 2.7503 -1301.0
## - MEI       1   0.82823 3.1412 -1263.2
## 
## Step:  AIC=-1350.1
## Temp ~ MEI + CO2 + N2O + CFC.11 + CFC.12 + TSI + Aerosols
## 
##            Df Sum of Sq    RSS     AIC
## <none>                  2.3135 -1350.1
## - N2O       1   0.03133 2.3448 -1348.3
## - CO2       1   0.06672 2.3802 -1344.0
## - CFC.12    1   0.13023 2.4437 -1336.5
## - CFC.11    1   0.13938 2.4529 -1335.5
## - TSI       1   0.33500 2.6485 -1313.7
## - Aerosols  1   0.43987 2.7534 -1302.7
## - MEI       1   0.83118 3.1447 -1264.9

Inference:

Some interesting things about step():

  • step() does not address, “Collinearity” of the variables.

  • _Except, + highley correlated var, doesn’t improve –> R-squared sif.fig

    • Result: Does not produce a very interpratable model, but rather

      a “balanced quality” & simplicity.

5. Testing on Unseen Data

So how well our model does on “unseen” real world data?

  • Using our previous model from step function,

  • To cal “temperature prediction for the testing data” using the predict function.

predictiveModel <- predict( newModel2, newdata = GBtemp.Test)

# This model doesent produce R-squared, so we have to manually calculate it.
SSE <- sum( ( predictiveModel - GBtemp.Test $ Temp ) ^ 2 )
SSE
## [1] 0.2176444
summary(GBtemp_Training)
##       Year          Month             MEI               CO2       
##  Min.   :1983   Min.   : 1.000   Min.   :-1.5860   Min.   :340.2  
##  1st Qu.:1989   1st Qu.: 4.000   1st Qu.:-0.3230   1st Qu.:352.3  
##  Median :1995   Median : 7.000   Median : 0.3085   Median :359.9  
##  Mean   :1995   Mean   : 6.556   Mean   : 0.3419   Mean   :361.4  
##  3rd Qu.:2001   3rd Qu.:10.000   3rd Qu.: 0.8980   3rd Qu.:370.6  
##  Max.   :2006   Max.   :12.000   Max.   : 3.0010   Max.   :385.0  
##       CH4            N2O            CFC.11          CFC.12     
##  Min.   :1630   Min.   :303.7   Min.   :191.3   Min.   :350.1  
##  1st Qu.:1716   1st Qu.:307.7   1st Qu.:249.6   1st Qu.:462.5  
##  Median :1759   Median :310.8   Median :260.4   Median :522.1  
##  Mean   :1746   Mean   :311.7   Mean   :252.5   Mean   :494.2  
##  3rd Qu.:1782   3rd Qu.:316.1   3rd Qu.:267.4   3rd Qu.:541.0  
##  Max.   :1808   Max.   :320.5   Max.   :271.5   Max.   :543.8  
##       TSI          Aerosols            Temp        
##  Min.   :1365   Min.   :0.00160   Min.   :-0.2820  
##  1st Qu.:1366   1st Qu.:0.00270   1st Qu.: 0.1180  
##  Median :1366   Median :0.00620   Median : 0.2325  
##  Mean   :1366   Mean   :0.01772   Mean   : 0.2478  
##  3rd Qu.:1366   3rd Qu.:0.01400   3rd Qu.: 0.4065  
##  Max.   :1367   Max.   :0.14940   Max.   : 0.7390
SST <- sum( ( mean( GBtemp_Training $ Temp) - GBtemp.Test $ Temp )^2)
SST
## [1] 0.5860189
# Then R-suared is:
1 - (SSE / SST )
## [1] 0.6286051

Assignment 2.1

Problem 1


# 1 Loading the data set
pisaTrain <- read.csv("C:/Users/pySag/Desktop/MITx15.071x/Datasets/pisa2009train.csv")

pisaTest <- read.csv("C:/Users/pySag/Desktop/MITx15.071x/Datasets/pisa2009test.csv") 
#1.1 How many students are there in our dataset?
str(pisaTrain)
## 'data.frame':    3663 obs. of  24 variables:
##  $ grade                : int  11 11 9 10 10 10 10 10 9 10 ...
##  $ male                 : int  1 1 1 0 1 1 0 0 0 1 ...
##  $ raceeth              : Factor w/ 7 levels "American Indian/Alaska Native",..: NA 7 7 3 4 3 2 7 7 5 ...
##  $ preschool            : int  NA 0 1 1 1 1 0 1 1 1 ...
##  $ expectBachelors      : int  0 0 1 1 0 1 1 1 0 1 ...
##  $ motherHS             : int  NA 1 1 0 1 NA 1 1 1 1 ...
##  $ motherBachelors      : int  NA 1 1 0 0 NA 0 0 NA 1 ...
##  $ motherWork           : int  1 1 1 1 1 1 1 0 1 1 ...
##  $ fatherHS             : int  NA 1 1 1 1 1 NA 1 0 0 ...
##  $ fatherBachelors      : int  NA 0 NA 0 0 0 NA 0 NA 0 ...
##  $ fatherWork           : int  1 1 1 1 0 1 NA 1 1 1 ...
##  $ selfBornUS           : int  1 1 1 1 1 1 0 1 1 1 ...
##  $ motherBornUS         : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ fatherBornUS         : int  0 1 1 1 0 1 NA 1 1 1 ...
##  $ englishAtHome        : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ computerForSchoolwork: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ read30MinsADay       : int  0 1 0 1 1 0 0 1 0 0 ...
##  $ minutesPerWeekEnglish: int  225 450 250 200 250 300 250 300 378 294 ...
##  $ studentsInEnglish    : int  NA 25 28 23 35 20 28 30 20 24 ...
##  $ schoolHasLibrary     : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ publicSchool         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ urban                : int  1 0 0 1 1 0 1 0 1 0 ...
##  $ schoolSize           : int  673 1173 1233 2640 1095 227 2080 1913 502 899 ...
##  $ readingScore         : num  476 575 555 458 614 ...
# Ans: 3663

#1.2 using tapply() on pisaTrain, what is the average reading test score of males?
tapply( pisaTrain $ readingScore, pisaTrain $ male, mean)
##        0        1 
## 512.9406 483.5325
# Ans: male ~ 483.5325, consequently female ~ 512.9406

#(1.3) Locating missing values
# Which variables are missing data in at least one observation in the training set?

summary(pisaTrain)
##      grade            male                      raceeth    
##  Min.   : 8.00   Min.   :0.0000   White             :2015  
##  1st Qu.:10.00   1st Qu.:0.0000   Hispanic          : 834  
##  Median :10.00   Median :1.0000   Black             : 444  
##  Mean   :10.09   Mean   :0.5111   Asian             : 143  
##  3rd Qu.:10.00   3rd Qu.:1.0000   More than one race: 124  
##  Max.   :12.00   Max.   :1.0000   (Other)           :  68  
##                                   NA's              :  35  
##    preschool      expectBachelors     motherHS    motherBachelors 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.00   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :1.00   Median :0.0000  
##  Mean   :0.7228   Mean   :0.7859   Mean   :0.88   Mean   :0.3481  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.00   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00   Max.   :1.0000  
##  NA's   :56       NA's   :62       NA's   :97     NA's   :397     
##    motherWork        fatherHS      fatherBachelors    fatherWork    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.7345   Mean   :0.8593   Mean   :0.3319   Mean   :0.8531  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :93       NA's   :245      NA's   :569      NA's   :233     
##    selfBornUS      motherBornUS     fatherBornUS    englishAtHome   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.9313   Mean   :0.7725   Mean   :0.7668   Mean   :0.8717  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :69       NA's   :71       NA's   :113      NA's   :71      
##  computerForSchoolwork read30MinsADay   minutesPerWeekEnglish
##  Min.   :0.0000        Min.   :0.0000   Min.   :   0.0       
##  1st Qu.:1.0000        1st Qu.:0.0000   1st Qu.: 225.0       
##  Median :1.0000        Median :0.0000   Median : 250.0       
##  Mean   :0.8994        Mean   :0.2899   Mean   : 266.2       
##  3rd Qu.:1.0000        3rd Qu.:1.0000   3rd Qu.: 300.0       
##  Max.   :1.0000        Max.   :1.0000   Max.   :2400.0       
##  NA's   :65            NA's   :34       NA's   :186          
##  studentsInEnglish schoolHasLibrary  publicSchool        urban       
##  Min.   : 1.0      Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:20.0      1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :25.0      Median :1.0000   Median :1.0000   Median :0.0000  
##  Mean   :24.5      Mean   :0.9676   Mean   :0.9339   Mean   :0.3849  
##  3rd Qu.:30.0      3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :75.0      Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :249       NA's   :143                                       
##    schoolSize    readingScore  
##  Min.   : 100   Min.   :168.6  
##  1st Qu.: 712   1st Qu.:431.7  
##  Median :1212   Median :499.7  
##  Mean   :1369   Mean   :497.9  
##  3rd Qu.:1900   3rd Qu.:566.2  
##  Max.   :6694   Max.   :746.0  
##  NA's   :162
# Ans: Based on the summary of dataset

#(1.4) Removing missing values
# To remove observations with any missing values from pisaTrain and pisaTest:
      # use function na.omit()
# data frame = na.omit(data frame)

pisaTrain <- na.omit(pisaTrain)
pisaTest <- na.omit(pisaTest)

# How many observations are now in tht training set?
str(pisaTrain)
## 'data.frame':    2414 obs. of  24 variables:
##  $ grade                : int  11 10 10 10 10 10 10 10 11 9 ...
##  $ male                 : int  1 0 1 0 1 0 0 0 1 1 ...
##  $ raceeth              : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 3 4 7 5 4 7 4 7 7 ...
##  $ preschool            : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ expectBachelors      : int  0 1 0 1 1 1 1 0 1 1 ...
##  $ motherHS             : int  1 0 1 1 1 1 1 0 1 1 ...
##  $ motherBachelors      : int  1 0 0 0 1 0 0 0 0 1 ...
##  $ motherWork           : int  1 1 1 0 1 1 1 0 0 1 ...
##  $ fatherHS             : int  1 1 1 1 0 1 1 0 1 1 ...
##  $ fatherBachelors      : int  0 0 0 0 0 0 1 0 1 1 ...
##  $ fatherWork           : int  1 1 0 1 1 0 1 1 1 1 ...
##  $ selfBornUS           : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ motherBornUS         : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ fatherBornUS         : int  1 1 0 1 1 0 1 0 1 1 ...
##  $ englishAtHome        : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ computerForSchoolwork: int  1 1 1 1 1 0 1 1 1 1 ...
##  $ read30MinsADay       : int  1 1 1 1 0 1 1 1 0 0 ...
##  $ minutesPerWeekEnglish: int  450 200 250 300 294 232 225 270 275 225 ...
##  $ studentsInEnglish    : int  25 23 35 30 24 14 20 25 30 15 ...
##  $ schoolHasLibrary     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ publicSchool         : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ urban                : int  0 1 1 0 0 0 0 1 1 1 ...
##  $ schoolSize           : int  1173 2640 1095 1913 899 1733 149 1400 1988 915 ...
##  $ readingScore         : num  575 458 614 439 466 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1249] 1 3 6 7 9 11 13 21 29 30 ...
##   .. ..- attr(*, "names")= chr [1:1249] "1" "3" "6" "7" ...
# How many observations are now in the testing set?
str(pisaTest)
## 'data.frame':    990 obs. of  24 variables:
##  $ grade                : int  10 10 10 10 11 10 10 10 10 10 ...
##  $ male                 : int  0 0 0 0 0 1 0 1 1 0 ...
##  $ raceeth              : Factor w/ 7 levels "American Indian/Alaska Native",..: 7 7 1 7 7 4 7 4 7 4 ...
##  $ preschool            : int  1 1 1 1 0 1 0 1 1 1 ...
##  $ expectBachelors      : int  0 1 0 0 0 1 1 0 1 1 ...
##  $ motherHS             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ motherBachelors      : int  1 0 0 0 1 1 0 0 1 0 ...
##  $ motherWork           : int  1 0 0 1 1 1 0 1 1 1 ...
##  $ fatherHS             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fatherBachelors      : int  0 1 0 0 1 0 0 0 1 1 ...
##  $ fatherWork           : int  0 1 0 1 1 1 1 0 1 1 ...
##  $ selfBornUS           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ motherBornUS         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fatherBornUS         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ englishAtHome        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ computerForSchoolwork: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ read30MinsADay       : int  0 0 1 1 1 1 0 0 0 1 ...
##  $ minutesPerWeekEnglish: int  240 240 240 270 270 350 350 360 350 360 ...
##  $ studentsInEnglish    : int  30 30 30 35 30 25 27 28 25 27 ...
##  $ schoolHasLibrary     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ publicSchool         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ urban                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolSize           : int  808 808 808 808 808 899 899 899 899 899 ...
##  $ readingScore         : num  355 454 405 665 605 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:580] 2 3 4 6 12 16 17 19 22 23 ...
##   .. ..- attr(*, "names")= chr [1:580] "2" "3" "4" "6" ...

Problem 2


Factor variables are variables that take on a discrete set of values, like the “Region” variable in the WHO dataset from the second lecture of Unit 1. This is an unordered factor because there isn’t any natural ordering between the levels. An ordered factor has a natural ordering between the levels (an example would be the classifications “large,” “medium,” and “small”).

# Which of the following variables is an unordered factor with at least 3 levels?
# Ans: raceeth
# Which of the following variables is an ordered factor with at least 3 levels?
# Ans: grade

#(2.2) Unordered factors in regression models
    # To include unordered factors in our linear regression model, we define 1 level as reference level.

    # In this way a factor with 'n' levels is replaced by 'n-1' binary var's.

# (2.3) 
#Consider again adding our unordered factor race to the regression model with reference level "White".

#For a student who is Asian, which binary variable would be set to 0? All remaining variable will be set to 1.

Major Proejct 2016